The Human Activity Recognition database was built from the recordings of 30 study participants performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. The objective is to classify activities into one of the six activities performed.
The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.
For each record in the dataset the following is provided:
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain. Dec 2012
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge L. Reyes-Ortiz. Energy Efficient Smartphone-Based Activity Recognition using Fixed-Point Arithmetic. Journal of Universal Computer Science. Special Issue in Ambient Assisted Living: Home Care. Volume 19, Issue 9. May 2013
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. 4th International Workshop of Ambient Assited Living, IWAAL 2012, Vitoria-Gasteiz, Spain, December 3-5, 2012. Proceedings. Lecture Notes in Computer Science 2012, pp 216-223.
Jorge Luis Reyes-Ortiz, Alessandro Ghio, Xavier Parra-Llanas, Davide Anguita, Joan Cabestany, Andreu Català . Human Activity and Motion Disorder Recognition: Towards Smarter Interactive Cognitive Environments. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.
# Pre-load all R packages
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(h2o))
suppressPackageStartupMessages(library(plotly))# Start and connect to a H2O cluster (JVM)
h2o.init(nthreads = -1)
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
/tmp/Rtmp9896Ad/h2o_joe_started_from_r.out
/tmp/Rtmp9896Ad/h2o_joe_started_from_r.err
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-0ubuntu1.16.04.2-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
Starting H2O JVM and connecting: . Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 seconds 385 milliseconds
H2O cluster version: 3.10.5.1
H2O cluster version age: 10 days
H2O cluster name: H2O_started_from_R_joe_knw156
H2O cluster total nodes: 1
H2O cluster total memory: 5.21 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 3.4.0 (2017-04-21)
h2o.no_progress() # disable progress bar in notebbok# Check if the datasets exist (locally)
chk_train <- suppressMessages(file.exists("./data/train.csv.gz"))
chk_test <- suppressMessages(file.exists("./data/test.csv.gz"))
# Import datasets (locally)
if (chk_train) hex_train <- h2o.importFile("./data/train.csv.gz")
if (chk_test) hex_test <- h2o.importFile("./data/test.csv.gz")
# Import datasets (from GitHub if they are not available locally)
if (!chk_train) hex_train <- h2o.importFile("https://github.com/woobe/h2o_demo_for_ibm_dsx/blob/master/data/train.csv.gz?raw=true")
if (!chk_test) hex_test <- h2o.importFile("https://github.com/woobe/h2o_demo_for_ibm_dsx/blob/master/data/test.csv.gz?raw=true")# Dimensions
# 'Train' dataset has 7352 rows and 562 columns
# 'Test' dataset has 2947 rows and 562 columns
dim(hex_train)[1] 7352 562
dim(hex_test)[1] 2947 562
# First few records
# First column is the label 'activity'
# Rest of the columns (V1 to V561) are sensors data
head(hex_train)head(hex_test)# Look at 'activity' column
# Six classes (Carinality = 6)
# No missing value
h2o.describe(hex_train$activity)h2o.describe(hex_test$activity)# Extract 'activity' columns for other graphics packages in R
d_activity_train <- as.data.frame(hex_train$activity)
d_activity_test <- as.data.frame(hex_test$activity)
# Count acitivity
d_freq_train <- as.data.frame(table(d_activity_train))
d_freq_test <- as.data.frame(table(d_activity_test))
d_freq <- merge(d_freq_train, d_freq_test, by.x = "d_activity_train", by.y = "d_activity_test", sort = FALSE)
colnames(d_freq) <- c("activity", "freq_train", "freq_test")
d_freq# Visualize 'activity' in both 'train' and 'test'
p <- plot_ly(d_freq, x = ~activity, y = ~freq_train, type = 'bar', name = 'Frequency (Train)') %>%
add_trace(y = ~freq_test, name = 'Frequency (Test)') %>%
layout(title = "Activities in 'Train' and 'Test' Dataset") %>%
layout(yaxis = list(title = 'Count'), xaxis = list(title = "")) %>%
layout(margin = list(b = 90)) %>%
layout(barmode = "group")
p# Look at relationship between sensor data `f1_tBodyAccmeanX` and activity
d_f1 <- data.frame(V1_train = as.data.frame(hex_train$f1_tBodyAccmeanX), activity = as.data.frame(hex_train$activity))
head(d_f1)p <- plot_ly(d_f1, y = ~f1_tBodyAccmeanX, color = ~activity, type = "box") %>%
layout(title = "Relationship between Sensor Data `f1_tBodyAccmeanX` and Activities") %>%
layout(yaxis = list(title = 'f1_tBodyAccmeanX'), xaxis = list(title = "")) %>%
layout(margin = list(b = 90))
p# Principal Component Analysis
# 95% of variance in original data captured by first five principal components
model_pca <- h2o.prcomp(training_frame = hex_train,
x = 2:562,
model_id = "h2o_pca",
k = 5)
model_pca Model Details:
==============
H2ODimReductionModel: pca
Model ID: h2o_pca
Importance of components:
pc1 pc2 pc3 pc4 pc5
Standard deviation 16.190113 4.587733 1.570451 1.441637 0.980662
Proportion of Variance 0.864715 0.069434 0.008136 0.006856 0.003173
Cumulative Proportion 0.864715 0.934148 0.942285 0.949141 0.952313
H2ODimReductionMetrics: pca
No model metrics available for PCA
# Visualize principle components with activity labels
d_pca <- as.data.frame(h2o.predict(model_pca, hex_train))
d_pca <- data.frame(d_pca, as.data.frame(hex_train$activity))
head(d_pca)p <- plot_ly(data = d_pca, x = ~PC2, y = ~PC3, color = ~activity,
type = "scatter", mode = "markers", marker = list(size = 3)) %>%
layout(title = "Visualizing Principle Components")
pFrom the graph above, we can see that:
# Define target and features for model training
target <- "activity"
features <- setdiff(colnames(hex_train), target) # i.e. using the records of all 561 sensors# Build a GBM model with cross-validation and early stopping
model <- h2o.gbm(x = features,
y = target,
training_frame = hex_train,
model_id = "h2o_gbm",
ntrees = 500,
learn_rate = 0.05,
learn_rate_annealing = 0.999,
max_depth = 7,
sample_rate = 0.9,
col_sample_rate = 0.9,
nfolds = 3,
fold_assignment = "Stratified",
stopping_metric = "logloss",
stopping_rounds = 5,
score_tree_interval = 10,
seed = 1234)# Print out model summary
modelModel Details:
==============
H2OMultinomialModel: gbm
Model ID: h2o_gbm
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 290 1740 1459657 1 7 6.99655 2 83 55.64828
H2OMultinomialMetrics: gbm
** Reported on training data. **
Training Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 1.02967e-11
RMSE: (Extract with `h2o.rmse`) 3.208847e-06
Logloss: (Extract with `h2o.logloss`) 1.142173e-06
Mean Per-Class Error: 0
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS WALKING_UPSTAIRS Error Rate
LAYING 1407 0 0 0 0 0 0.0000 = 0 / 1,407
SITTING 0 1286 0 0 0 0 0.0000 = 0 / 1,286
STANDING 0 0 1374 0 0 0 0.0000 = 0 / 1,374
WALKING 0 0 0 1226 0 0 0.0000 = 0 / 1,226
WALKING_DOWNSTAIRS 0 0 0 0 986 0 0.0000 = 0 / 986
WALKING_UPSTAIRS 0 0 0 0 0 1073 0.0000 = 0 / 1,073
Totals 1407 1286 1374 1226 986 1073 0.0000 = 0 / 7,352
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`
=======================================================================
Top-6 Hit Ratios:
k hit_ratio
1 1 1.000000
2 2 1.000000
3 3 1.000000
4 4 1.000000
5 5 1.000000
6 6 1.000000
H2OMultinomialMetrics: gbm
** Reported on cross-validation data. **
** 3-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
Cross-Validation Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.007344104
RMSE: (Extract with `h2o.rmse`) 0.08569775
Logloss: (Extract with `h2o.logloss`) 0.02944961
Mean Per-Class Error: 0.008763655
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,xval = TRUE)`
=======================================================================
Top-6 Hit Ratios:
k hit_ratio
1 1 0.990751
2 2 1.000000
3 3 1.000000
4 4 1.000000
5 5 1.000000
6 6 1.000000
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid
accuracy 0.99077135 8.7470456E-4 0.9896743 0.9925 0.99013966
err 0.009228656 8.7470456E-4 0.010325655 0.0075 0.009860313
err_count 22.666666 2.4037008 26.0 18.0 24.0
logloss 0.029395567 0.0018986394 0.03290966 0.02639239 0.02888465
max_per_class_error 0.032861423 0.0031006113 0.034632035 0.03712297 0.026829269
mean_per_class_accuracy 0.9912118 8.688647E-4 0.9903684 0.9929493 0.99031764
mean_per_class_error 0.008788202 8.688647E-4 0.009631551 0.0070507196 0.009682334
mse 0.007332566 5.250083E-4 0.007918376 0.006284995 0.007794328
r2 0.99743503 1.6752906E-4 0.99722123 0.99776536 0.99731845
rmse 0.08551624 0.003125672 0.08898526 0.07927796 0.08828549
# Look at variable importance in this GBM model
h2o.varimp(model)Variable Importances:
variable relative_importance scaled_importance percentage
1 f53_tGravityAccminX 9839.461914 1.000000 0.204901
2 f560_angleYgravityMean 3236.193359 0.328899 0.067392
3 f10_tBodyAccmaxX 3065.950928 0.311597 0.063847
4 f167_tBodyGyroJerkmadX 2081.444580 0.211540 0.043345
5 f41_tGravityAccmeanX 1930.086060 0.196158 0.040193
---
variable relative_importance scaled_importance percentage
556 f494_fBodyGyrobandsEnergy4148 0.008722 0.000001 0.000000
557 f471_fBodyGyrobandsEnergy3348 0.007785 0.000001 0.000000
558 f99_tBodyAccJerkenergyZ 0.004651 0.000000 0.000000
559 f98_tBodyAccJerkenergyY 0.003739 0.000000 0.000000
560 f362_fBodyAccJerkenergyY 0.002965 0.000000 0.000000
561 f548_fBodyBodyGyroJerkMagenergy 0.002210 0.000000 0.000000
# Visualize variable importance
h2o.varimp_plot(model, num_of_features = 15)# Make predictions
yhat_test <- h2o.predict(model, hex_test)
head(yhat_test)# Evaluate predictions
h2o.performance(model, newdata = hex_test)H2OMultinomialMetrics: gbm
Test Set Metrics:
=====================
MSE: (Extract with `h2o.mse`) 0.06358653
RMSE: (Extract with `h2o.rmse`) 0.2521637
Logloss: (Extract with `h2o.logloss`) 0.3214465
Mean Per-Class Error: 0.07575778
Confusion Matrix: Extract with `h2o.confusionMatrix(<model>, <data>)`)
=========================================================================
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
LAYING SITTING STANDING WALKING WALKING_DOWNSTAIRS WALKING_UPSTAIRS Error Rate
LAYING 537 0 0 0 0 0 0.0000 = 0 / 537
SITTING 0 401 89 0 0 1 0.1833 = 90 / 491
STANDING 0 40 492 0 0 0 0.0752 = 40 / 532
WALKING 0 0 0 481 4 11 0.0302 = 15 / 496
WALKING_DOWNSTAIRS 0 0 0 10 378 32 0.1000 = 42 / 420
WALKING_UPSTAIRS 0 1 0 24 6 440 0.0658 = 31 / 471
Totals 537 442 581 515 388 484 0.0740 = 218 / 2,947
Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>, <data>)`
=======================================================================
Top-6 Hit Ratios:
k hit_ratio
1 1 0.926026
2 2 0.990159
3 3 0.994231
4 4 0.996607
5 5 1.000000
6 6 1.000000
As expected: - It is easy to classify Laying - It is difficult to distinguish between Sitting and Standing
h2o.saveModel(model_pca, path = "./models")
h2o.saveModel(model, path = "./models")